Exploratory Subgroup Identification with ForestSearch
German Breast Cancer Study Group (GBSG) Analysis
Author
Larry F. León
Published
January 31, 2026
1 Introduction
This vignette demonstrates the ForestSearch methodology for exploratory subgroup identification in survival analysis, as described in León et al. (2024) Statistics in Medicine.
1.1 Motivation
In clinical trials, particularly oncology, subgroup analyses are essential for:
Evaluating treatment effect consistency across patient populations
Identifying subgroups where treatment may be detrimental (harm)
Characterizing subgroups with enhanced benefit
Informing regulatory decisions and clinical practice
While prespecified subgroups provide stronger evidence, important subgroups based on patient characteristics may not be anticipated. ForestSearch provides a principled approach to exploratory subgroup identification with proper statistical inference.
1.2 Methodology Overview
ForestSearch identifies subgroups through:
Candidate factor selection: Using LASSO and/or Generalized Random Forests (GRF)
Exhaustive subgroup search: Evaluating all combinations up to maxk factors
Bootstrap bias correction: Adjusting for selection-induced optimism
Cross-validation: Assessing algorithm stability
The key innovation is the splitting consistency criterion: a subgroup is considered “consistent with harm” if, when randomly split 50/50 many times, both halves consistently show hazard ratios ≥ 1.0 (for example if 1.0 represents a meaningful “harm threshold”).
2 Setup
2.1 Load Required Packages
Show code
library(forestsearch)library(survival)library(data.table)library(ggplot2)library(gt)library(grf)library(policytree)library(doFuture)library(doRNG)# Optional packages for enhanced outputlibrary(patchwork)library(weightedsurv)# Set ggplot themetheme_set(theme_minimal(base_size =12))
3 Data: German Breast Cancer Study Group Trial
3.1 Study Background
The GBSG trial evaluated hormonal treatment (tamoxifen) versus chemotherapy in node-positive breast cancer patients. Key characteristics:
Sample size: N = 686
Outcome: Recurrence-free survival time
Censoring rate: ~56%
Treatment: Hormonal therapy (tamoxifen) vs. chemotherapy
cat("Subgroup size:", sum(fs$df.est$treat.recommend ==0), sprintf("(%.1f%% of ITT)\n", 100*mean(fs$df.est$treat.recommend ==0)))
Subgroup size: 82 (12.0% of ITT)
ForestSearch identifies Estrogen ≤ 0 (ER-negative) as the subgroup with potential harm. This is biologically plausible: tamoxifen is a selective estrogen receptor modulator with limited efficacy in ER-negative tumors.
6 Bootstrap Bias Correction
6.1 Rationale
Cox model estimates from identified subgroups are upwardly biased due to the selection process (subgroups are selected because they show extreme effects). Bootstrap bias correction addresses this by:
Resampling with replacement
Re-running the entire ForestSearch algorithm
Computing bias terms from bootstrap vs. observed estimates
# Number of bootstrap iterations# Use 500-2000 for production; reduced here for vignetteNB <-1000t0 <-proc.time()fs_bc <-forestsearch_bootstrap_dofuture(fs.est = fs,nb_boots = NB,show_three =TRUE,details =TRUE)
Ystar matrix generated should be 'boots x N': 1000 x 686
ForestSearch parameters for bootstrap iterations:
- sg_focus: hrMaxSG
- maxk: 2
- fs.splits: 400
- max_subgroups_search: 10
- hr.threshold: 1.25
- hr.consistency: 1
- pconsistency.threshold: 0.9
- n.min: 60
- use_twostage: TRUE
- use_lasso: TRUE
- use_grf: TRUE
Bootstrap-specific overrides:
- grf_res: NULL (forces re-selection)
- grf_cuts: NULL (forces re-selection)
- parallel_args: sequential (prevents nested parallelism)
- details: FALSE (suppressed in workers)
- plot.sg: FALSE
- plot.grf: FALSE
Note: stop_threshold reset to NULL for sg_focus = 'hrMaxSG'.
Early stopping disabled when HR is prioritized in selection;
Also consider increasing max_subgroups_search if search does not appear exhaustive.
=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30
Maximum total splits: 400
Batch size: 20
================================================
GRF stage for cut selection with dmin, tau = 12 0.6
return_selected_cuts_only = TRUE: using cuts from selected tree only
tau, maxdepth = 46.75811 2
leaf.node control.mean control.size control.se depth
1 2 6.81 93.00 3.14 1
2 3 -2.83 593.00 1.00 1
3 4 -4.19 120.00 2.55 2
4 5 5.06 144.00 1.88 2
5 6 10.40 65.00 3.58 2
6 7 -5.46 357.00 1.27 2
Selected subgroup:
leaf.node control.mean control.size control.se depth
5 6 10.40 65.00 3.58 2
GRF subgroup found
Terminating node at max.diff (sg.harm.id):
[1] "er <= 3"
Cuts from selected tree (depth = 2 ):
[1] "age <= 50" "age <= 45" "er <= 3"
GRF cuts identified: 3
Cuts: age <= 50, age <= 45, er <= 3
Selected tree depth: 2
# of continuous/categorical characteristics 5 2
Continuous characteristics: age size nodes pgr er
Categorical characteristics: meno grade3
## Prior to lasso: age size nodes pgr er
#### Lasso selection results
7 x 1 sparse Matrix of class "dgCMatrix"
s0
age 0.009841255
meno .
size .
grade3 0.040809356
nodes 0.064431931
pgr -0.001089501
er .
Cox-LASSO selected: age grade3 nodes pgr
Cox-LASSO not selected: meno size er
### End Lasso selection
## After lasso: age nodes pgr
Default cuts included from Lasso: age <= mean(age) age <= median(age) age <= qlow(age) age <= qhigh(age) nodes <= mean(nodes) nodes <= median(nodes) nodes <= qlow(nodes) nodes <= qhigh(nodes) pgr <= mean(pgr) pgr <= median(pgr) pgr <= qlow(pgr) pgr <= qhigh(pgr)
Categorical after Lasso: grade3
Factors per GRF: age <= 50 age <= 45 er <= 3
Initial GRF cuts included age <= 50 age <= 45 er <= 3
Factors included per GRF (not in lasso) age <= 50 age <= 45 er <= 3
===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 16 cut expressions once and caching...
Cut evaluation summary:
Total cuts: 16
Valid cuts: 16
Errors: 0
✓ All 16 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====
# of candidate subgroup factors= 16
[1] "age <= 50" "age <= 45" "er <= 3" "age <= 54.1" "age <= 54"
[6] "age <= 47" "age <= 62" "nodes <= 5.2" "nodes <= 3" "nodes <= 1"
[11] "nodes <= 7" "pgr <= 112.9" "pgr <= 37.5" "pgr <= 8" "pgr <= 134.2"
[16] "grade3"
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 528
Events criteria: control >= 10 , treatment >= 10
Sample size criteria: n >= 60
Subgroup search completed in 0.02 minutes
--- Filtering Summary ---
Combinations evaluated: 528
Passed variance check: 484
Passed prevalence (>= 0.025 ): 484
Passed redundancy check: 454
Passed event counts (d0>= 10 , d1>= 10 ): 352
Passed sample size (n>= 60 ): 337
Cox model fit successfully: 337
Passed HR threshold (>= 1.25 ): 45
-------------------------
Found 45 subgroup candidate(s)
# of candidate subgroups (meeting all criteria) = 45
Random seed set to: 8316951
Two-stage parameters:
n.splits.screen: 30
screen.threshold: 0.763
batch.size: 20
conf.level: 0.95
Removed 5 near-duplicate subgroups
# of unique initial candidates: 40
# Restricting to top stop_Kgroups = 10
# of candidates to evaluate: 10
Parallel config: workers = 1 , batch_size = 2
Batch 1 / 5 : candidates 1 - 2
Batch 2 / 5 : candidates 3 - 4
Batch 3 / 5 : candidates 5 - 6
Batch 4 / 5 : candidates 7 - 8
Batch 5 / 5 : candidates 9 - 10
Evaluated 10 of 10 candidates (complete)
10 subgroups passed consistency threshold
SG focus = hrMaxSG
Seconds and minutes forestsearch overall = 4.825 0.0804
Consistency algorithm used: twostage
Note: stop_threshold reset to NULL for sg_focus = 'hrMaxSG'.
Early stopping disabled when HR is prioritized in selection;
Also consider increasing max_subgroups_search if search does not appear exhaustive.
=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30
Maximum total splits: 400
Batch size: 20
================================================
GRF stage for cut selection with dmin, tau = 12 0.6
return_selected_cuts_only = TRUE: using cuts from selected tree only
tau, maxdepth = 46.75811 2
leaf.node control.mean control.size control.se depth
1 2 1.66 85.00 3.10 1
2 3 -4.14 601.00 1.00 1
3 4 -6.89 142.00 2.40 2
4 5 6.76 73.00 2.03 2
5 6 2.78 80.00 2.22 2
6 7 -5.33 391.00 1.28 2
Selected subgroup:
leaf.node control.mean control.size control.se depth
4 5 6.76 73.00 2.03 2
GRF subgroup found
Terminating node at max.diff (sg.harm.id):
[1] "age <= 45"
Cuts from selected tree (depth = 2 ):
[1] "age <= 47" "age <= 45" "size <= 19"
GRF cuts identified: 3
Cuts: age <= 47, age <= 45, size <= 19
Selected tree depth: 2
# of continuous/categorical characteristics 5 2
Continuous characteristics: age size nodes pgr er
Categorical characteristics: meno grade3
## Prior to lasso: age size nodes pgr er
#### Lasso selection results
7 x 1 sparse Matrix of class "dgCMatrix"
s0
age -0.021898377
meno 0.553187113
size 0.015037354
grade3 0.056745121
nodes 0.055166642
pgr -0.003205087
er .
Cox-LASSO selected: age meno size grade3 nodes pgr
Cox-LASSO not selected: er
### End Lasso selection
## After lasso: age size nodes pgr
Default cuts included from Lasso: age <= mean(age) age <= median(age) age <= qlow(age) age <= qhigh(age) size <= mean(size) size <= median(size) size <= qlow(size) size <= qhigh(size) nodes <= mean(nodes) nodes <= median(nodes) nodes <= qlow(nodes) nodes <= qhigh(nodes) pgr <= mean(pgr) pgr <= median(pgr) pgr <= qlow(pgr) pgr <= qhigh(pgr)
Categorical after Lasso: meno grade3
Factors per GRF: age <= 47 age <= 45 size <= 19
Initial GRF cuts included age <= 47 age <= 45 size <= 19
Factors included per GRF (not in lasso) age <= 47 age <= 45 size <= 19
===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 20 cut expressions once and caching...
Cut evaluation summary:
Total cuts: 20
Valid cuts: 20
Errors: 0
✓ All 20 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====
# of candidate subgroup factors= 20
[1] "age <= 47" "age <= 45" "size <= 19" "age <= 52.8" "age <= 52"
[6] "age <= 61" "size <= 29.7" "size <= 25" "size <= 21" "size <= 35"
[11] "nodes <= 5" "nodes <= 3" "nodes <= 1" "nodes <= 7" "pgr <= 107.6"
[16] "pgr <= 29" "pgr <= 7" "pgr <= 116" "meno" "grade3"
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 820
Events criteria: control >= 10 , treatment >= 10
Sample size criteria: n >= 60
Subgroup search completed in 0.04 minutes
--- Filtering Summary ---
Combinations evaluated: 820
Passed variance check: 766
Passed prevalence (>= 0.025 ): 766
Passed redundancy check: 731
Passed event counts (d0>= 10 , d1>= 10 ): 523
Passed sample size (n>= 60 ): 523
Cox model fit successfully: 523
Passed HR threshold (>= 1.25 ): 8
-------------------------
Found 8 subgroup candidate(s)
# of candidate subgroups (meeting all criteria) = 8
Random seed set to: 8316951
Two-stage parameters:
n.splits.screen: 30
screen.threshold: 0.763
batch.size: 20
conf.level: 0.95
Removed 1 near-duplicate subgroups
# of unique initial candidates: 7
# Restricting to top stop_Kgroups = 10
# of candidates to evaluate: 7
Parallel config: workers = 1 , batch_size = 2
Batch 1 / 4 : candidates 1 - 2
Batch 2 / 4 : candidates 3 - 4
Batch 3 / 4 : candidates 5 - 6
Batch 4 / 4 : candidates 7 - 7
Evaluated 7 of 7 candidates (complete)
1 subgroups passed consistency threshold
SG focus = hrMaxSG
Seconds and minutes forestsearch overall = 5.972 0.0995
Consistency algorithm used: twostage
Note: stop_threshold reset to NULL for sg_focus = 'hrMaxSG'.
Early stopping disabled when HR is prioritized in selection;
Also consider increasing max_subgroups_search if search does not appear exhaustive.
=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30
Maximum total splits: 400
Batch size: 20
================================================
GRF stage for cut selection with dmin, tau = 12 0.6
return_selected_cuts_only = TRUE: using cuts from selected tree only
tau, maxdepth = 45.06283 2
leaf.node control.mean control.size control.se depth
1 2 5.16 78.00 3.08 1
2 3 -4.22 608.00 0.97 1
3 4 -5.91 136.00 2.04 2
4 5 6.82 92.00 2.33 2
5 6 -6.08 357.00 1.29 2
6 7 1.84 101.00 2.45 2
Selected subgroup:
leaf.node control.mean control.size control.se depth
4 5 6.82 92.00 2.33 2
GRF subgroup found
Terminating node at max.diff (sg.harm.id):
[1] "age <= 44"
Cuts from selected tree (depth = 2 ):
[1] "age <= 47" "age <= 44" "age <= 63"
GRF cuts identified: 3
Cuts: age <= 47, age <= 44, age <= 63
Selected tree depth: 2
# of continuous/categorical characteristics 5 2
Continuous characteristics: age size nodes pgr er
Categorical characteristics: meno grade3
## Prior to lasso: age size nodes pgr er
#### Lasso selection results
7 x 1 sparse Matrix of class "dgCMatrix"
s0
age 0.002705392
meno .
size 0.005359394
grade3 0.259039181
nodes 0.044877257
pgr -0.002216517
er .
Cox-LASSO selected: age size grade3 nodes pgr
Cox-LASSO not selected: meno er
### End Lasso selection
## After lasso: age size nodes pgr
Default cuts included from Lasso: age <= mean(age) age <= median(age) age <= qlow(age) age <= qhigh(age) size <= mean(size) size <= median(size) size <= qlow(size) size <= qhigh(size) nodes <= mean(nodes) nodes <= median(nodes) nodes <= qlow(nodes) nodes <= qhigh(nodes) pgr <= mean(pgr) pgr <= median(pgr) pgr <= qlow(pgr) pgr <= qhigh(pgr)
Categorical after Lasso: grade3
Factors per GRF: age <= 47 age <= 44 age <= 63
Initial GRF cuts included age <= 47 age <= 44 age <= 63
Factors included per GRF (not in lasso) age <= 47 age <= 44 age <= 63
===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 20 cut expressions once and caching...
Cut evaluation summary:
Total cuts: 20
Valid cuts: 20
Errors: 0
✓ All 20 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====
# of candidate subgroup factors= 20
[1] "age <= 47" "age <= 44" "age <= 63" "age <= 52.6" "age <= 52"
[6] "age <= 46" "age <= 60.8" "size <= 29.5" "size <= 25" "size <= 21"
[11] "size <= 35" "nodes <= 5.1" "nodes <= 3" "nodes <= 1" "nodes <= 7"
[16] "pgr <= 104.9" "pgr <= 33" "pgr <= 7" "pgr <= 126" "grade3"
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 820
Events criteria: control >= 10 , treatment >= 10
Sample size criteria: n >= 60
Subgroup search completed in 0.05 minutes
--- Filtering Summary ---
Combinations evaluated: 820
Passed variance check: 760
Passed prevalence (>= 0.025 ): 760
Passed redundancy check: 720
Passed event counts (d0>= 10 , d1>= 10 ): 560
Passed sample size (n>= 60 ): 545
Cox model fit successfully: 545
Passed HR threshold (>= 1.25 ): 4
-------------------------
Found 4 subgroup candidate(s)
# of candidate subgroups (meeting all criteria) = 4
Random seed set to: 8316951
Two-stage parameters:
n.splits.screen: 30
screen.threshold: 0.763
batch.size: 20
conf.level: 0.95
# of unique initial candidates: 4
# Restricting to top stop_Kgroups = 10
# of candidates to evaluate: 4
Parallel config: workers = 1 , batch_size = 2
Batch 1 / 2 : candidates 1 - 2
Batch 2 / 2 : candidates 3 - 4
Evaluated 4 of 4 candidates (complete)
2 subgroups passed consistency threshold
SG focus = hrMaxSG
Seconds and minutes forestsearch overall = 5.822 0.097
Consistency algorithm used: twostage
Kaplan-Meier survival curves by identified subgroup
Note: Identified subgroup: {er <= 0}. HR(bc) = bootstrap bias-corrected hazard ratio. Medians [95% CI] for arms are un-adjusted.
6.4.1 Event Count Summary
Low event counts can lead to unstable HR estimates. This summary helps identify potential issues:
Show code
# note that default required minimum events is 12 for subgroup candidate# Here we evaluate frequency of subgroup candidates in bootstrap samples less than 15event_summary <-summarize_bootstrap_events(fs_bc, threshold =15)
=== Bootstrap Event Count Summary ===
Total bootstrap iterations: 1000
Event threshold: <15 events
ORIGINAL Subgroup H on BOOTSTRAP samples:
Control arm <15 events: 0 (0.0%)
Treatment arm <15 events: 0 (0.0%)
Either arm <15 events: 0 (0.0%)
ORIGINAL Subgroup Hc on BOOTSTRAP samples:
Control arm <15 events: 0 (0.0%)
Treatment arm <15 events: 0 (0.0%)
Either arm <15 events: 0 (0.0%)
NEW Subgroups found: 874 (87.4%)
NEW Subgroup H* on ORIGINAL data:
Control arm <15 events: 23 (2.6% of successful)
Treatment arm <15 events: 113 (12.9% of successful)
Either arm <15 events: 128 (14.6% of successful)
NEW Subgroup Hc* on ORIGINAL data:
Control arm <15 events: 0 (0.0% of successful)
Treatment arm <15 events: 0 (0.0% of successful)
Either arm <15 events: 0 (0.0% of successful)
6.4.2 Bootstrap Diagnostics
Show code
# Quality metricssummaries$diagnostics_table_gt
Bootstrap Diagnostics Summary
Analysis of 1000 bootstrap iterations
Category
Metric
Value
Success Rate
Total iterations
1000
Successful
874 (87.4%)
Failed
126 (12.6%)
Success rating
Good
Subgroup H (Questionable)
Observed HR
1.951
Bias-corrected HR
1.528
Bootstrap CV (%)
99.3%
N estimates
874
Subgroup Hc (Recommend)
Observed HR
0.615
Bias-corrected HR
0.643
Bootstrap CV (%)
48.7%
N estimates
874
6.4.3 Subgroup Agreement
How consistently does bootstrap identify the same subgroup?
Show code
# Agreement with original analysisif (!is.null(summaries$subgroup_summary$original_agreement)) { summaries$subgroup_summary$original_agreement}
Metric Value
<char> <char>
1: Total bootstrap iterations 1000
2: Successful iterations 874
3: Failed iterations (no subgroup) 126
4:
5: Original subgroup definition {er <= 0}
6: Exact match with original 108 (12.4%)
7: Different from original 766 (87.6%)
8: Partial match (shared factor) 121 (13.8%)
Show code
# Factor presence across bootstrap iterationsif (!is.null(summaries$subgroup_summary$factor_presence)) { summaries$subgroup_summary$factor_presence}
The solid black line denotes the ITT Kaplan-Meier treatment difference estimates along with 95%95% CIs (the grey shaded region). K-M differences corresponding to subgroups are displayed.
Show code
# # Core display of ITT and identified subgroups# plot_km_band_forestsearch(# df = df.analysis,# fs.est = fs,# outcome.name = outcome.name,# event.name = event.name,# treat.name = treat.name# )# Add additional subgroups along with ITT and identified subgroupsref_sgs <-list(age_young =list(subset_expr ="age < 65", color ="brown"),age_old =list(subset_expr ="age >= 65", color ="orange"))plot_km_band_forestsearch(df = df.analysis,fs.est = fs,ref_subgroups = ref_sgs,outcome.name = outcome.name,event.name = event.name,treat.name = treat.name,draws_band =1000)
Show code
# # Example with more subgroups# ref_sgs <- list(# pgr_positive = list(subset_expr = "pgr > 0", color ="green"),# pgr_negative = list(subset_expr = "pgr <= 0", color = "purple"),# age_young = list(subset_expr = "age < 65", color = "brown"),# age_old = list(subset_expr = "age >= 65", color = "orange")# )# plot_km_band_forestsearch(# df = df.analysis,# fs.est = fs,# ref_subgroups = ref_sgs,# outcome.name = outcome.name,# event.name = event.name,# treat.name = treat.name# )
The ForestSearch analysis identifies estrogen receptor-negative (ER ≤ 0) patients as a subgroup with potential lack of benefit from hormonal therapy.
Biological plausibility: Tamoxifen is a selective estrogen receptor modulator. Its efficacy depends on ER expression. The finding that ER-negative patients may not benefit is consistent with:
Mechanistic understanding of tamoxifen action
Meta-analyses showing no tamoxifen benefit in ER-negative breast cancer
Clinical guidelines recommending tamoxifen primarily for ER-positive tumors
Caveats:
This is an exploratory analysis requiring independent validation
The bias-corrected estimates have wider confidence intervals
Cross-validation metrics should be evaluated for algorithm stability
León LF, Jemielita T, Guo Z, Marceau West R, Anderson KM (2024). “Exploratory subgroup identification in the heterogeneous Cox model: A relatively simple procedure.” Statistics in Medicine. DOI: 10.1002/sim.10163